Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 76
Filtrar
1.
IEEE Trans Image Process ; 33: 2279-2292, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38478437

RESUMO

In this paper, we propose an anycost network quantization method for efficient image super-resolution with variable resource budgets. Conventional quantization approaches acquire discrete network parameters for deployment with fixed complexity constraints, while image super-resolution networks are usually applied on mobile devices with frequently modified resource budgets due to the change of battery levels or computing chips. Hence, exhaustively optimizing quantized networks with each complexity constraint results in unacceptable training costs. On the contrary, we construct a hyper-network whose parameters can efficiently adapt to different resource budgets with negligible finetuning cost, so that the image super-resolution networks can be feasibly deployed in diversified devices with variable resource budgets. Specifically, we dynamically search the optimal bitwidth for each patch in convolution according to feature maps and complexity constraints, which aims to achieve the best efficiency-accuracy trade-off in image super-resolution given the resource budget. To acquire the hyper-network that can be efficiently adapted to different bitwidth settings, we actively sample the patch-wise bitwidth during training and adaptively ensemble gradients from hyper-network in different precision for faster convergence and higher generalization ability. Compared with existing quantization methods, experimental results demonstrate that our method significantly reduces the cost of adapting models in new resource budgets with comparable efficiency-accuracy trade-offs.

2.
IEEE Trans Image Process ; 33: 1016-1031, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38265893

RESUMO

In this paper, we present a Structure-aware Cross-Modal Transformer (SCMT) to fully capture the 3D structures hidden in sparse depths for depth completion. Most existing methods learn to predict dense depths by taking depths as an additional channel of RGB images or learning 2D affinities to perform depth propagation. However, they fail to exploit 3D structures implied in the depth channel, thereby losing the informative 3D knowledge that provides important priors to distinguish the foreground and background features. Moreover, since these methods rely on the color textures of 2D images, it is challenging for them to handle poor-texture regions without the guidance of explicit 3D cues. To address this, we disentangle the hierarchical 3D scene-level structure from the RGB-D input and construct a pathway to make sharp depth boundaries and object shape outlines accessible to 2D features. Specifically, we extract 2D and 3D features from depth inputs and the back-projected point clouds respectively by building a two-stream network. To leverage 3D structures, we construct several cross-modal transformers to adaptively propagate multi-scale 3D structural features to the 2D stream, energizing 2D features with priors of object shapes and local geometries. Experimental results show that our SCMT achieves state-of-the-art performance on three popular outdoor (KITTI) and indoor (VOID and NYU) benchmarks.

3.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4381-4397, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38227416

RESUMO

Nowadays, pre-training big models on large-scale datasets has achieved great success and dominated many downstream tasks in natural language processing and 2D vision, while pre-training in 3D vision is still under development. In this paper, we provide a new perspective of transferring the pre-trained knowledge from 2D domain to 3D domain with Point-to-Pixel Prompting in data space and Pixel-to-Point distillation in feature space, exploiting shared knowledge in images and point clouds that display the same visual world. Following the principle of prompting engineering, Point-to-Pixel Prompting transforms point clouds into colorful images with geometry-preserved projection and geometry-aware coloring. Then the pre-trained image models can be directly implemented for point cloud tasks without structural changes or weight modifications. With projection correspondence in feature space, Pixel-to-Point distillation further regards pre-trained image models as the teacher model and distills pre-trained 2D knowledge to student point cloud models, remarkably enhancing inference efficiency and model capacity for point cloud analysis. We conduct extensive experiments in both object classification and scene segmentation under various settings to demonstrate the superiority of our method. In object classification, we reveal the important scale-up trend of Point-to-Pixel Prompting and attain 90.3% accuracy on ScanObjectNN dataset, surpassing previous literature by a large margin. In scene-level semantic segmentation, our method outperforms traditional 3D analysis approaches and shows competitive capacity in dense prediction tasks.

4.
IEEE Trans Pattern Anal Mach Intell ; 46(4): 1964-1980, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-37669195

RESUMO

This paper proposes an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images. Conventional deep metric learning methods focus on learning a discriminative embedding to describe the semantic features of images, which ignore the existence of uncertainty in each image resulting from noise or semantic ambiguity. Training without awareness of these uncertainties causes the model to overfit the annotated labels during training and produce overconfident judgments during inference. Motivated by this, we argue that a good similarity model should consider the semantic discrepancies with awareness of the uncertainty to better deal with ambiguous images for more robust training. To achieve this, we propose to represent an image using not only a semantic embedding but also an accompanying uncertainty embedding, which describes the semantic characteristics and ambiguity of an image, respectively. We further propose an introspective similarity metric to make similarity judgments between images considering both their semantic differences and ambiguities. The gradient analysis of the proposed metric shows that it enables the model to learn at an adaptive and slower pace to deal with the uncertainty during training. Our framework attains state-of-the-art performance on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets for image retrieval. We further evaluate our framework for image classification on the ImageNet-1 K, CIFAR-10, and CIFAR-100 datasets, which shows that equipping existing data mixing methods with the proposed introspective metric consistently achieves better results (e.g., +0.44% for CutMix on ImageNet-1 K).

5.
IEEE Trans Pattern Anal Mach Intell ; 46(2): 1165-1180, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37906482

RESUMO

In this paper, we propose a weakly-supervised approach for 3D object detection, which makes it possible to train a strong 3D detector with position-level annotations (i.e. annotations of object centers and categories). In order to remedy the information loss from box annotations to centers, our method makes use of synthetic 3D shapes to convert the position-level annotations into virtual scenes with box-level annotations, and in turn utilizes the fully-annotated virtual scenes to complement the real labels. Specifically, we first present a shape-guided label-enhancement method, which assembles 3D shapes into physically reasonable virtual scenes according to the coarse scene layout extracted from position-level annotations. Then we transfer the information contained in the virtual scenes back to real ones by applying a virtual-to-real domain adaptation method, which refines the annotated object centers and additionally supervises the training of detector with the virtual scenes. Since the shape-guided label enhancement method generates virtual scenes by human-heuristic physical constraints, the layout of the fixed virtual scenes may be unreasonable with varied object combinations. To address this, we further present differentiable label enhancement to optimize the virtual scenes including object scales, orientations and locations in a data-driven manner. Moreover, we further propose a label-assisted self-training strategy to fully exploit the capability of detector. By reusing the position-level annotations and virtual scenes, we fuse the information from both domains and generate box-level pseudo labels on the real scenes, which enables us to directly train a detector in fully-supervised manner. Extensive experiments on the widely used ScanNet and Matterport3D datasets show that our approach surpasses current weakly-supervised and semi-supervised methods by a large margin, and achieves comparable detection performance with some popular fully-supervised methods with less than 5% of the labeling labor.

6.
IEEE Trans Pattern Anal Mach Intell ; 46(4): 2518-2532, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38019629

RESUMO

In this paper, we present a new framework named DIML to achieve more interpretable deep metric learning. Unlike traditional deep metric learning method that simply produces a global similarity given two images, DIML computes the overall similarity through the weighted sum of multiple local part-wise similarities, making it easier for human to understand the mechanism of how the model distinguish two images. Specifically, we propose a structural matching strategy that explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images. We also devise a multi-scale matching strategy, which considers both global and local similarities and can significantly reduce the computational costs in the application of image retrieval. To handle the view variance in some complicated scenarios, we propose to use cross-correlation as the marginal distribution of the optimal transport to leverage semantic information to locate the important region in the images. Our framework is model-agnostic, which can be applied to off-the-shelf backbone networks and metric learning methods. To extend our DIML to more advanced architectures like vision Transformers (ViTs), we further propose truncated attention rollout and partial similarity to overcome the lack of locality in ViTs. We evaluate our method on three major benchmarks of deep metric learning including CUB200-2011, Cars196, and Stanford Online Products, and achieve substantial improvements over popular metric learning methods with better interpretability.

7.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 2981-2996, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38015703

RESUMO

In this paper, we propose a dynamic 3D object detector named HyperDet3D, which is adaptively adjusted based on the hyper scene-level knowledge on the fly. Existing methods strive for object-level representations of local elements and their relations without scene-level priors, which suffer from ambiguity between similarly-structured objects only based on the understanding of individual points and object candidates. Instead, we design scene-conditioned hypernetworks to simultaneously learn scene-agnostic embeddings to exploit sharable abstracts from various 3D scenes, and scene-specific knowledge which adapts the 3D detector to the given scene at test time. As a result, the lower-level ambiguity in object representations can be addressed by hierarchical context in scene priors. However, since the upstream hypernetwork in HyperDet3D takes raw scenes as input which contain noises and redundancy, it leads to sub-optimal parameters produced for the 3D detector simply under the constraint of downstream detection losses. Based on the fact that the downstream 3D detection task can be factorized into object-level semantic classification and bounding box regression, we furtherly propose HyperFormer3D by correspondingly designing their scene-level prior tasks in upstream hypernetworks, namely Semantic Occurrence and Objectness Localization. To this end, we design a transformer-based hypernetwork that translates the task-oriented scene priors into parameters of the downstream detector, which refrains from noises and redundancy of the scenes. Extensive experimental results on the ScanNet, SUN RGB-D and MatterPort3D datasets demonstrate the effectiveness of the proposed methods.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 14114-14130, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37924200

RESUMO

In this paper, we propose a Transformer encoder-decoder architecture, called PoinTr, which reformulates point cloud completion as a set-to-set translation problem and employs a geometry-aware block to model local geometric relationships explicitly. The migration of Transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Taking a step towards more complicated and diverse situations, we further propose AdaPoinTr by developing an adaptive query generation mechanism and designing a novel denoising task during completing a point cloud. Coupling these two techniques enables us to train the model efficiently and effectively: we reduce training time (by 15x or more) and improve completion performance (over 20%). Additionally, we propose two more challenging benchmarks with more diverse incomplete point clouds that can better reflect real-world scenarios to promote future research. We also show our method can be extended to the scene-level point cloud completion scenario by designing a new geometry-enhanced semantic scene completion framework. Extensive experiments on the existing and newly-proposed datasets demonstrate the effectiveness of our method, which attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI, surpassing other work by a large margin and establishing new state-of-the-arts on various benchmarks. Most notably, AdaPoinTr can achieve such promising performance with higher throughputs and fewer FLOPs compared with the previous best methods in practice.

9.
IEEE Trans Image Process ; 32: 3759-3773, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37405880

RESUMO

In this paper, we propose a discrepancy-aware meta-learning approach for zero-shot face manipulation detection, which aims to learn a discriminative model maximizing the generalization to unseen face manipulation attacks with the guidance of the discrepancy map. Unlike existing face manipulation detection methods that usually present algorithmic solutions to the known face manipulation attacks, where the same types of attacks are used to train and test the models, we define the detection of face manipulation as a zero-shot problem. We formulate the learning of the model as a meta-learning process and generate zero-shot face manipulation tasks for the model to learn the meta-knowledge shared by diversified attacks. We utilize the discrepancy map to keep the model focused on generalized optimization directions during the meta-learning process. We further incorporate a center loss to better guide the model to explore more effective meta-knowledge. Experimental results on the widely used face manipulation datasets demonstrate that our proposed approach achieves very competitive performance under the zero-shot setting.

10.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 14005-14019, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37498756

RESUMO

Face clustering is a promising method for annotating unlabeled face images. Recent supervised approaches have boosted the face clustering accuracy greatly, however their performance is still far from satisfactory. These methods can be roughly divided into global-based and local-based ones. Global-based methods suffer from the limitation of training data scale, while local-based ones are inefficient for inference due to the use of numerous overlapped subgraphs. Previous approaches fail to tackle these two challenges simultaneously. To address the dilemma of large-scale training and efficient inference, we propose the STructure-AwaRe Face Clustering (STAR-FC) method. Specifically, we design a structure-preserving subgraph sampling strategy to explore the power of large-scale training data, which can increase the training data scale from 105 to 107. On this basis, a novel hierarchical GCN training paradigm is further proposed for better capturing the dynamic local structure. During inference, the STAR-FC performs efficient full-graph clustering with two steps: graph parsing and graph refinement. And the concept of node intimacy is introduced in the second step to mine the local structural information, where a calibration module is further proposed for fairer edge scores. The STAR-FC gets 93.21 pairwise F-score on standard partial MS1M within 312 seconds, which far surpasses the state-of-the-arts while maintaining high inference efficiency. Furthermore, we are the first to train on an ultra-large-scale graph with 20 M nodes, and achieve superior inference results on 12 M testing data. Overall, as a simple and effective method, the proposed STAR-FC provides a strong baseline for large-scale face clustering.

11.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13621-13635, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37432799

RESUMO

In this paper, we propose Point-Voxel Correlation Fields to explore relations between two consecutive point clouds and estimate scene flow that represents 3D motions. Most existing works only consider local correlations, which are able to handle small movements but fail when there are large displacements. Therefore, it is essential to introduce all-pair correlation volumes that are free from local neighbor restrictions and cover both short- and long-term dependencies. However, it is challenging to efficiently extract correlation features from all-pairs fields in the 3D space, given the irregular and unordered nature of point clouds. To tackle this problem, we present point-voxel correlation fields, proposing distinct point and voxel branches to inquire about local and long-range correlations from all-pair fields respectively. To exploit point-based correlations, we adopt the K-Nearest Neighbors search that preserves fine-grained information in the local region, which guarantees the scene flow estimation precision. By voxelizing point clouds in a multi-scale manner, we construct pyramid correlation voxels to model long-range correspondences, which are utilized to handle fast-moving objects. Integrating these two types of correlations, we propose Point-Voxel Recurrent All-Pairs Field Transforms (PV-RAFT) architecture that employs an iterative scheme to estimate scene flow from point clouds. To adapt to different flow scope conditions and obtain more fine-grained results, we further propose Deformable PV-RAFT (DPV-RAFT), where the Spatial Deformation deforms the voxelized neighborhood, and the Temporal Deformation controls the iterative update process. We evaluate the proposed method on the FlyingThings3D and KITTI Scene Flow 2015 datasets and experimental results show that we outperform state-of-the-art methods by remarkable margins.

12.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 11689-11706, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37141057

RESUMO

Generative data-free quantization emerges as a practical compression approach that quantizes deep neural networks to low bit-width without accessing the real data. This approach generates data utilizing batch normalization (BN) statistics of the full-precision networks to quantize the networks. However, it always faces the serious challenges of accuracy degradation in practice. We first give a theoretical analysis that the diversity of synthetic samples is crucial for the data-free quantization, while in existing approaches, the synthetic data completely constrained by BN statistics experimentally exhibit severe homogenization at distribution and sample levels. This paper presents a generic Diverse Sample Generation (DSG) scheme for the generative data-free quantization, to mitigate detrimental homogenization. We first slack the statistics alignment for features in the BN layer to relax the distribution constraint. Then, we strengthen the loss impact of the specific BN layers for different samples and inhibit the correlation among samples in the generation process, to diversify samples from the statistical and spatial perspectives, respectively. Comprehensive experiments show that for large-scale image classification tasks, our DSG can consistently quantization performance on different neural architectures, especially under ultra-low bit-width. And data diversification caused by our DSG brings a general gain to various quantization-aware training and post-training quantization approaches, demonstrating its generality and effectiveness.

13.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 11040-11052, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37074897

RESUMO

Deep learning based fusion methods have been achieving promising performance in image fusion tasks. This is attributed to the network architecture that plays a very important role in the fusion process. However, in general, it is hard to specify a good fusion architecture, and consequently, the design of fusion networks is still a black art, rather than science. To address this problem, we formulate the fusion task mathematically, and establish a connection between its optimal solution and the network architecture that can implement it. This approach leads to a novel method proposed in the paper of constructing a lightweight fusion network. It avoids the time-consuming empirical network design by a trial-and-test strategy. In particular we adopt a learnable representation approach to the fusion task, in which the construction of the fusion network architecture is guided by the optimisation algorithm producing the learnable model. The low-rank representation (LRR) objective is the foundation of our learnable model. The matrix multiplications, which are at the heart of the solution are transformed into convolutional operations, and the iterative process of optimisation is replaced by a special feed-forward network. Based on this novel network architecture, an end-to-end lightweight fusion network is constructed to fuse infrared and visible light images. Its successful training is facilitated by a detail-to-semantic information loss function proposed to preserve the image details and to enhance the salient features of the source images. Our experiments show that the proposed fusion network exhibits better fusion performance than the state-of-the-art fusion methods on public datasets. Interestingly, our network requires a fewer training parameters than other existing methods.

14.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 8265-8283, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37018614

RESUMO

In this paper, we propose a deep metric learning with adaptively composite dynamic constraints (DML-DC) method for image retrieval and clustering. Most existing deep metric learning methods impose pre-defined constraints on the training samples, which might not be optimal at all stages of training. To address this, we propose a learnable constraint generator to adaptively produce dynamic constraints to train the metric towards good generalization. We formulate the objective of deep metric learning under a proxy Collection, pair Sampling, tuple Construction, and tuple Weighting (CSCW) paradigm. For proxy collection, we progressively update a set of proxies using a cross-attention mechanism to integrate information from the current batch of samples. For pair sampling, we employ a graph neural network to model the structural relations between sample-proxy pairs to produce the preservation probabilities for each pair. Having constructed a set of tuples based on the sampled pairs, we further re-weight each training tuple to adaptively adjust its effect on the metric. We formulate the learning of the constraint generator as a meta-learning problem, where we employ an episode-based training scheme and update the generator at each iteration to adapt to the current model status. We construct each episode by sampling two subsets of disjoint labels to simulate the procedure of training and testing and use the performance of the one-gradient-updated metric on the validation subset as the meta-objective of the assessor. We conducted extensive experiments on five widely used benchmarks under two evaluation protocols to demonstrate the effectiveness of the proposed framework.

15.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 10835-10849, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37015126

RESUMO

In this work, we present a new multi-view depth estimation method NerfingMVS that utilizes both conventional reconstruction and learning-based priors over the recently proposed neural radiance fields (NeRF). Unlike existing neural network based optimization method that relies on estimated correspondences, our method directly optimizes over implicit volumes, eliminating the challenging step of matching pixels in indoor scenes. The key to our approach is to utilize the learning-based priors to guide the optimization process of NeRF. Our system first adapts a monocular depth network over the target scene by finetuning on its MVS reconstruction from COLMAP. Then, we show that the shape-radiance ambiguity of NeRF still exists in indoor environments and propose to address the issue by employing the adapted depth priors to monitor the sampling process of volume rendering. Finally, a per-pixel confidence map acquired by error computation on the rendered image can be used to further improve the depth quality. We further present NerfingMVS++, where a coarse-to-fine depth priors training strategy is proposed to directly utilize sparse SfM points and the uniform sampling is replaced by Gaussian sampling to boost the performance. Experiments show that our NerfingMVS and its extension NerfingMVS++ achieve state-of-the-art performances on indoor datasets ScanNet and NYU Depth V2. In addition, we show that the guided optimization scheme does not sacrifice the original synthesis capability of neural radiance fields, improving the rendering quality on both seen and novel views. Code is available at https://github.com/weiyithu/NerfingMVS.

16.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 8813-8826, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37015428

RESUMO

In this article, we propose extremely low-precision vision transformers called Quantformer for efficient inference. Conventional network quantization methods directly quantize weights and activations of fully-connected layers without considering properties of transformer architectures. Quantization sizably deviates the self-attention compared with full-precision counterparts, and the shared quantization strategy for diversely distributed patch features causes severe quantization errors. To address these issues, we enforce the self-attention rank in quantized transformers to mimic that in full-precision counterparts with capacity-aware distribution for information retention, and quantize patch features with group-wise discretization strategy for quantization error minimization. Specifically, we efficiently preserve the self-attention rank consistency by minimizing the distance between the self-attention in quantized and real-valued transformers with adaptive concentration degree, where the optimal concentration degree is selected according to the self-attention entropy for model capacity adaptation. Moreover, we partition patch features in different dimensions with differentiable group assignment, so that features in different groups leverage various discretization strategies with minimal rounding and clipping errors. Experimental results show that our Quantformer outperforms the state-of-the-art network quantization methods by a sizable margin across various vision transformer architectures in image classification and object detection. We also integrate our Quantformer with mixed-precision quantization to further enhance the performance of the vanilla models.

17.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 10960-10973, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37030707

RESUMO

Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter Network (GFNet), a conceptually simple yet computationally efficient architecture, that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our architecture replaces the self-attention layer in vision Transformers with three key operations: a 2D discrete Fourier transform, an element-wise multiplication between frequency-domain features and learnable global filters, and a 2D inverse Fourier transform. Based on this basic design, we develop a series of isotropic models with a Transformer-style simple architecture and CNN-style hierarchical models with better performance. Isotropic GFNet models exhibit favorable accuracy/complexity trade-offs compared to recent vision Transformers and pure MLP models. Hierarchical GFNet models can inherit successful designs in CNNs and be easily scaled up with larger model sizes and more training data, showing strong performance on both image classification (e.g., 85.0% top-1 accuracy on ImageNet-1 k without any extra data or supervision, and 87.4% accuracy with ImageNet-21 k pre-training) and dense prediction tasks (e.g., 54.3 mIoU on ADE20 k val). Our results demonstrate that GFNet can be a very competitive alternative to Transformer-based models and CNNs in terms of efficiency, generalization ability and robustness. Code is available at https://github.com/raoyongming/GFNet.

18.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 10883-10897, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37030709

RESUMO

In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative regions, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers. Specifically, we devise a lightweight prediction module to estimate the importance of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. While the framework is inspired by our observation of the sparse attention in vision Transformers, we find that the idea of adaptive and asymmetric computation can be a general solution for accelerating various architectures. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks. To handle structured feature maps, we formulate a generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations. By applying lightweight fast paths to less informative features and expressive slow paths to important locations, we can maintain the complete structure of feature maps while significantly reducing the overall computations. Extensive experiments on diverse modern architectures and different visual tasks demonstrate the effectiveness of our proposed framework. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%  âˆ¼  35% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision Transformers. By introducing asymmetric computation, a similar acceleration can be achieved on modern CNNs and Swin Transformers. Moreover, our method achieves promising results on more complex tasks including semantic segmentation and object detection. Our results clearly demonstrate that dynamic spatial sparsification offers a new and more effective dimension for model acceleration. Code is available at https://github.com/raoyongming/DynamicViT.

19.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 9486-9503, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37022422

RESUMO

Existing image-based rendering methods usually adopt depth-based image warping operation to synthesize novel views. In this paper, we reason the essential limitations of the traditional warping operation to be the limited neighborhood and only distance-based interpolation weights. To this end, we propose content-aware warping, which adaptively learns the interpolation weights for pixels of a relatively large neighborhood from their contextual information via a lightweight neural network. Based on this learnable warping module, we propose a new end-to-end learning-based framework for novel view synthesis from a set of input source views, in which two additional modules, namely confidence-based blending and feature-assistant spatial refinement, are naturally proposed to handle the occlusion issue and capture the spatial correlation among pixels of the synthesized view, respectively. Besides, we also propose a weight-smoothness loss term to regularize the network. Experimental results on light field datasets with wide baselines and multi-view datasets show that the proposed method significantly outperforms state-of-the-art methods both quantitatively and visually. The source code is publicly available at https://github.com/MantangGuo/CW4VS.


Assuntos
Algoritmos , Aprendizagem , Redes Neurais de Computação , Software
20.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 2193-2207, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35294344

RESUMO

This work explores the use of global and local structures of 3D point clouds as a free and powerful supervision signal for representation learning. Local and global patterns of a 3D object are closely related. Although each part of an object is incomplete, the underlying attributes about the object are shared among all parts, which makes reasoning about the whole object from a single part possible. We hypothesize that a powerful representation of a 3D object should model the attributes that are shared between parts and the whole object, and distinguishable from other objects. Based on this hypothesis, we propose a new framework to learn point cloud representations by bidirectional reasoning between the local structures at different abstraction hierarchies and the global shape. Moreover, we extend the unsupervised structural representation learning method to more complex 3D scenes. By introducing structural proxies as the intermediate-level representations between local and global ones, we propose a hierarchical reasoning scheme among local parts, structural proxies, and the overall point cloud to learn powerful 3D representations in an unsupervised manner. Extensive experimental results demonstrate that the unsupervised representations can be very competitive alternatives of supervised representations in discriminative power, and exhibit better performance in generalization ability and robustness. Our method establishes the new state-of-the-art of unsupervised/few-shot 3D object classification and part segmentation. We also show our method can serve as a simple yet effective regime for model pre-training on 3D scene segmentation and detection tasks. We expect our observations to offer a new perspective on learning better representations from data structures instead of human annotations for point cloud understanding.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...